Add clustered migration sync for shared disks (SYNCING barrier)#403
Add clustered migration sync for shared disks (SYNCING barrier)#403fabi200123 wants to merge 1 commit intocloudbase:masterfrom
Conversation
coriolis/conductor/rpc/server.py
Outdated
| shared disks. Once the owner replicate task completes, any | ||
| waiting (SYNCING) replicate tasks are moved back to SCHEDULED | ||
| so they can continue their normal flow. | ||
| """ |
There was a problem hiding this comment.
So as far as I understand this, this will basically block the rest of the cluster while the main one gets transferred. I mean this is fine for shared disks, but what I proposed initially was to have them all running in parallel once the syncing tasks (which was not meant to be REPLICATE_DISKS btw) are done. You can add SYNCING on this task alone while waiting for the rest of the tasks to complete (tasks like DEPLOY_REPLICA_DISKS, SHUTDOWN_INSTNANCE), that's also feasible, but please set the volumes_info accordingly, not block all of them.
What I originally envisioned was something like this:
instance1 has: root_disk1, shared_disk1;
instance2 has root_diks2, shared_disk1.
What's the point of blocking instance2 while instance1 is replicating? When you can have the following:
instance1 replicates root_disk1, shared_disk1;
instance2 replicates root_disk2, skips shared_disk1, based on the volumes_info that you can set up before launching the replicate_disks task.
There was a problem hiding this comment.
Please let me know if there's anything preventing us from doing a parallel sync, with shared_disks being transferred by only one of the instances.
8152e8e to
9813d2b
Compare
Introduce TASK_STATUS_SYNCING and TASK_TYPES_TO_SYNC (GET_INSTANCE_INFO, DEPLOY_TRANSFER_DISKS, REPLICATE_DISKS) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks of the same type before leaving SYNCING for COMPLETED and advancing dependents. - clustered is set as len(instances) > 1 on transfer create - On task_completed: enter SYNCING when the barrier applies, then when every peer is SYNCING, run sync hooks (GET_INSTANCE_INFO: promote shareable on export disks, DEPLOY_TRANSFER_DISKS: shared-disk volumes_info, REPLICATE_DISKS: sync change_id) - ReplicateDisksTask: skip provider replicate for replicate_disk_data=False - On task error: abort peers stuck in SYNCING for the same task type
Introduce TASK_STATUS_SYNCING and TASK_TYPES_REQUIRING_CLUSTER_SYNC (DEPLOY_TRANSFER_DISKS, SHUTDOWN_INSTANCE) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks before marking COMPLETED and advancing dependents.
Volumes schema already allows extra properties; replicate_disk_data is consumed by replication only (default True preserves behavior).